White Wine Quality Exploration by Yuan Shi

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Univariate Plots Section

Most wines’ quality are around 6.

Most wines’ fixed.acidity are around 7.

The volatile.acidity data is long tail data. I use log10 to transform the volatile.acidity data, which appears unimodal with the volatile.acidity peaking around 0.4 or so.

A majority of wines’ citric.acid are about 0.25. They are found in small quantities, which can add ‘freshness’ and flavor to wines.

The residual.sugar data is long tail data. I use log10 to transform the residual.sugar data, which appears bimodal with the residual.sugar peaking around 2 or so and again at 9 or so.

The chlorides data is long tail data. I use log10 to transform the chlorides data, which appears unimodal with the chlorides peaking around 0.07 or so.

Free.sulfur.dioxide is skewed to the left. Most wines has free.sulfur.dioxide of about 30.

Most wines have a total.sulfur.dioxide between 100 mg/dm^3 and 150 mg/dm^3: median 134.0 mg/dm^3 and mean 138.4 mg/dm^3.

Most wines have a density between 0.992 g/cm^3 and 0.995 g/cm^3: median 0.9937 g/cm^3 and mean 0.994 g/cm^3.

Most wines have a pH between 3.1 and 3.2: median 3.180 and mean 3.188.

Most wines have a alcohol value between 9 and 11.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Univariate Analysis

What is the structure of your dataset?

There are 4898 diamonds in the dataset with 13 features (X, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality).

Other observations:

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality. I’d like to determine which features are best for predicting the quality of wines. I suspect alcohol and some combination of the other variables can be used to build a predictive model to wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Fixed.acidity, volatile.acidity, chlorides, total.sulfur.dioxide, density, pH and alcohol are likely contribute to the quality of wines. I think alcohol and density probably contribute most to the quality after researching information on wine quality.

Did you create any new variables from existing variables in the dataset?

When free SO2 concentrations are over 50 ppm, SO2 becomes evident in the nose and taste of wine. I created a variable to show whether free sulfur dioxide is over 50ppm. Current free sulfur dioxide value minus 50 is the free sulfur dioxide over 50ppm.

I want to find whether the bound sulfur dioxide has impact on wine quality, so I also created a variable for the bound sulfur dioxide of wines. Since total sulfur dioxide is the amount of free and bound forms of S02, the difference of total sulfur dioxide and free sulfur dioxide is the bound sulfur dioxide.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the left skewed residual.sugar distributions. The tranformed distribution for skewed residual.sugar appears bimodal with the residual.sugar peaking around 2 gram/liter or so and again around 9 gram/liter.

Bivariate Plots Section

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                                      X fixed.acidity volatile.acidity
## X                                    1       Pearson          Pearson
## fixed.acidity                  -0.2558             1          Pearson
## volatile.acidity              0.002858       -0.0227                1
## citric.acid                    -0.1499        0.2892          -0.1495
## residual.sugar                0.006624       0.08902          0.06429
## chlorides                     -0.04565       0.02309          0.07051
## free.sulfur.dioxide           -0.01193       -0.0494         -0.09701
## total.sulfur.dioxide            -0.162       0.09107          0.08926
## density                         -0.186        0.2653          0.02711
## pH                             -0.1158       -0.4259         -0.03192
## sulphates                     0.009808      -0.01714         -0.03573
## alcohol                         0.2137       -0.1209          0.06772
## quality                        0.03576       -0.1137          -0.1947
## free.sulfur.dioxide_over50ppm -0.01193       -0.0494         -0.09701
## bound.sulfur.dioxide           -0.1924        0.1357           0.1568
##                               citric.acid residual.sugar chlorides
## X                                 Pearson        Pearson   Pearson
## fixed.acidity                     Pearson        Pearson   Pearson
## volatile.acidity                  Pearson        Pearson   Pearson
## citric.acid                             1        Pearson   Pearson
## residual.sugar                    0.09421              1   Pearson
## chlorides                          0.1144        0.08868         1
## free.sulfur.dioxide               0.09408         0.2991    0.1014
## total.sulfur.dioxide               0.1211         0.4014    0.1989
## density                            0.1495          0.839    0.2572
## pH                                -0.1637        -0.1941  -0.09044
## sulphates                         0.06233       -0.02666   0.01676
## alcohol                          -0.07573        -0.4506   -0.3602
## quality                         -0.009209       -0.09758   -0.2099
## free.sulfur.dioxide_over50ppm     0.09408         0.2991    0.1014
## bound.sulfur.dioxide               0.1022         0.3448    0.1938
##                               free.sulfur.dioxide total.sulfur.dioxide
## X                                         Pearson              Pearson
## fixed.acidity                             Pearson              Pearson
## volatile.acidity                          Pearson              Pearson
## citric.acid                               Pearson              Pearson
## residual.sugar                            Pearson              Pearson
## chlorides                                 Pearson              Pearson
## free.sulfur.dioxide                             1              Pearson
## total.sulfur.dioxide                       0.6155                    1
## density                                    0.2942               0.5299
## pH                                     -0.0006178             0.002321
## sulphates                                 0.05922               0.1346
## alcohol                                   -0.2501              -0.4489
## quality                                  0.008158              -0.1747
## free.sulfur.dioxide_over50ppm                   1               0.6155
## bound.sulfur.dioxide                       0.2635               0.9225
##                                density         pH sulphates alcohol
## X                              Pearson    Pearson   Pearson Pearson
## fixed.acidity                  Pearson    Pearson   Pearson Pearson
## volatile.acidity               Pearson    Pearson   Pearson Pearson
## citric.acid                    Pearson    Pearson   Pearson Pearson
## residual.sugar                 Pearson    Pearson   Pearson Pearson
## chlorides                      Pearson    Pearson   Pearson Pearson
## free.sulfur.dioxide            Pearson    Pearson   Pearson Pearson
## total.sulfur.dioxide           Pearson    Pearson   Pearson Pearson
## density                              1    Pearson   Pearson Pearson
## pH                            -0.09359          1   Pearson Pearson
## sulphates                      0.07449      0.156         1 Pearson
## alcohol                        -0.7801     0.1214  -0.01743       1
## quality                        -0.3071    0.09943   0.05368  0.4356
## free.sulfur.dioxide_over50ppm   0.2942 -0.0006178   0.05922 -0.2501
## bound.sulfur.dioxide            0.5044   0.003143    0.1357 -0.4269
##                                quality free.sulfur.dioxide_over50ppm
## X                              Pearson                       Pearson
## fixed.acidity                  Pearson                       Pearson
## volatile.acidity               Pearson                       Pearson
## citric.acid                    Pearson                       Pearson
## residual.sugar                 Pearson                       Pearson
## chlorides                      Pearson                       Pearson
## free.sulfur.dioxide            Pearson                       Pearson
## total.sulfur.dioxide           Pearson                       Pearson
## density                        Pearson                       Pearson
## pH                             Pearson                       Pearson
## sulphates                      Pearson                       Pearson
## alcohol                        Pearson                       Pearson
## quality                              1                       Pearson
## free.sulfur.dioxide_over50ppm 0.008158                             1
## bound.sulfur.dioxide           -0.2179                        0.2635
##                               bound.sulfur.dioxide
## X                                          Pearson
## fixed.acidity                              Pearson
## volatile.acidity                           Pearson
## citric.acid                                Pearson
## residual.sugar                             Pearson
## chlorides                                  Pearson
## free.sulfur.dioxide                        Pearson
## total.sulfur.dioxide                       Pearson
## density                                    Pearson
## pH                                         Pearson
## sulphates                                  Pearson
## alcohol                                    Pearson
## quality                                    Pearson
## free.sulfur.dioxide_over50ppm              Pearson
## bound.sulfur.dioxide                             1
## 
## Standard Errors:
##                                     X fixed.acidity volatile.acidity
## X                                                                   
## fixed.acidity                 0.01336                               
## volatile.acidity              0.01429       0.01428                 
## citric.acid                   0.01397        0.0131          0.01397
## residual.sugar                0.01429       0.01418          0.01423
## chlorides                     0.01426       0.01428          0.01422
## free.sulfur.dioxide           0.01429       0.01426          0.01416
## total.sulfur.dioxide          0.01392       0.01417          0.01418
## density                        0.0138       0.01328          0.01428
## pH                             0.0141        0.0117          0.01428
## sulphates                     0.01429       0.01429          0.01427
## alcohol                       0.01364       0.01408          0.01422
## quality                       0.01427       0.01411          0.01375
## free.sulfur.dioxide_over50ppm 0.01429       0.01426          0.01416
## bound.sulfur.dioxide          0.01376       0.01403          0.01394
##                               citric.acid residual.sugar chlorides
## X                                                                 
## fixed.acidity                                                     
## volatile.acidity                                                  
## citric.acid                                                       
## residual.sugar                    0.01416                         
## chlorides                          0.0141        0.01418          
## free.sulfur.dioxide               0.01416        0.01301   0.01414
## total.sulfur.dioxide              0.01408        0.01199   0.01372
## density                           0.01397       0.004233   0.01334
## pH                                0.01391        0.01375   0.01417
## sulphates                         0.01423        0.01428   0.01429
## alcohol                           0.01421        0.01139   0.01244
## quality                           0.01429        0.01415   0.01366
## free.sulfur.dioxide_over50ppm     0.01416        0.01301   0.01414
## bound.sulfur.dioxide              0.01414        0.01259   0.01375
##                               free.sulfur.dioxide total.sulfur.dioxide
## X                                                                     
## fixed.acidity                                                         
## volatile.acidity                                                      
## citric.acid                                                           
## residual.sugar                                                        
## chlorides                                                             
## free.sulfur.dioxide                                                   
## total.sulfur.dioxide                     0.008878                     
## density                                   0.01305              0.01028
## pH                                        0.01429              0.01429
## sulphates                                 0.01424              0.01403
## alcohol                                    0.0134              0.01141
## quality                                   0.01429              0.01385
## free.sulfur.dioxide_over50ppm                   0             0.008878
## bound.sulfur.dioxide                       0.0133              0.00213
##                                density      pH sulphates alcohol quality
## X                                                                       
## fixed.acidity                                                           
## volatile.acidity                                                        
## citric.acid                                                             
## residual.sugar                                                          
## chlorides                                                               
## free.sulfur.dioxide                                                     
## total.sulfur.dioxide                                                    
## density                                                                 
## pH                             0.01416                                  
## sulphates                      0.01421 0.01394                          
## alcohol                       0.005594 0.01408   0.01429                
## quality                        0.01294 0.01415   0.01425 0.01158        
## free.sulfur.dioxide_over50ppm  0.01305 0.01429   0.01424  0.0134 0.01429
## bound.sulfur.dioxide           0.01065 0.01429   0.01403 0.01169 0.01361
##                               free.sulfur.dioxide_over50ppm
## X                                                          
## fixed.acidity                                              
## volatile.acidity                                           
## citric.acid                                                
## residual.sugar                                             
## chlorides                                                  
## free.sulfur.dioxide                                        
## total.sulfur.dioxide                                       
## density                                                    
## pH                                                         
## sulphates                                                  
## alcohol                                                    
## quality                                                    
## free.sulfur.dioxide_over50ppm                              
## bound.sulfur.dioxide                                 0.0133
## 
## n = 4898 
## 
## P-values for Tests of Bivariate Normality:
##                                        X fixed.acidity volatile.acidity
## X                                                                      
## fixed.acidity                 1.384e-135                               
## volatile.acidity                4.43e-79     8.326e-51                 
## citric.acid                   8.099e-177    7.094e-126        3.11e-162
## residual.sugar                1.269e-153    3.961e-142       3.871e-146
## chlorides                              0             0                0
## free.sulfur.dioxide            2.436e-59     9.489e-44        2.307e-50
## total.sulfur.dioxide           4.165e-65     1.731e-38        3.649e-49
## density                       6.906e-101     2.053e-49        1.458e-45
## pH                             2.823e-57     5.114e-36        2.379e-36
## sulphates                      1.308e-56     1.076e-33        4.068e-33
## alcohol                       3.053e-105     1.172e-74        1.458e-96
## quality                                0             0                0
## free.sulfur.dioxide_over50ppm  2.436e-59     9.489e-44        2.307e-50
## bound.sulfur.dioxide           1.722e-70     2.943e-38        9.033e-43
##                               citric.acid residual.sugar chlorides
## X                                                                 
## fixed.acidity                                                     
## volatile.acidity                                                  
## citric.acid                                                       
## residual.sugar                 6.704e-208                         
## chlorides                               0              0          
## free.sulfur.dioxide            1.481e-110     2.279e-119         0
## total.sulfur.dioxide           2.145e-108     9.659e-122         0
## density                        1.894e-132      3.89e-196         0
## pH                             3.439e-101     1.085e-119         0
## sulphates                      4.195e-103     2.257e-116         0
## alcohol                        6.265e-186     3.624e-202         0
## quality                                 0              0         0
## free.sulfur.dioxide_over50ppm  1.481e-110     2.279e-119         0
## bound.sulfur.dioxide           7.878e-118     1.409e-123         0
##                               free.sulfur.dioxide total.sulfur.dioxide
## X                                                                     
## fixed.acidity                                                         
## volatile.acidity                                                      
## citric.acid                                                           
## residual.sugar                                                        
## chlorides                                                             
## free.sulfur.dioxide                                                   
## total.sulfur.dioxide                    2.231e-30                     
## density                                 1.384e-52            1.193e-28
## pH                                      3.012e-24            3.591e-17
## sulphates                                1.06e-18            6.053e-32
## alcohol                                 9.643e-71            2.343e-57
## quality                                         0                    0
## free.sulfur.dioxide_over50ppm                 NaN            2.231e-30
## bound.sulfur.dioxide                    3.844e-25            3.543e-32
##                                  density        pH sulphates   alcohol
## X                                                                     
## fixed.acidity                                                         
## volatile.acidity                                                      
## citric.acid                                                           
## residual.sugar                                                        
## chlorides                                                             
## free.sulfur.dioxide                                                   
## total.sulfur.dioxide                                                  
## density                                                               
## pH                             1.448e-34                              
## sulphates                      1.796e-35 1.473e-17                    
## alcohol                       3.223e-108 2.598e-62 3.961e-84          
## quality                                0         0         0         0
## free.sulfur.dioxide_over50ppm  1.384e-52 3.012e-24  1.06e-18 9.643e-71
## bound.sulfur.dioxide           5.486e-24 1.779e-15 1.816e-39 2.181e-55
##                               quality free.sulfur.dioxide_over50ppm
## X                                                                  
## fixed.acidity                                                      
## volatile.acidity                                                   
## citric.acid                                                        
## residual.sugar                                                     
## chlorides                                                          
## free.sulfur.dioxide                                                
## total.sulfur.dioxide                                               
## density                                                            
## pH                                                                 
## sulphates                                                          
## alcohol                                                            
## quality                                                            
## free.sulfur.dioxide_over50ppm       0                              
## bound.sulfur.dioxide                0                     3.844e-25

From the above table, residual.sugar, total.sulfur.dioxide and chlorides do not seem to have strong correlations with quality, but they are moderately correlated with alcohol and density, which have relatively strong correlations with quality. I want to look closer at scatter plots involving quality and some other variables like alcohol, density, residual.sugar, total.sulfur.dioxide and chlorides.

Comparing alcohol to quality, the plot suffers from some overplotting. Most wines have a alcohol between 9 and 13.

## wq$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## wq$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## wq$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## wq$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## wq$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## wq$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## wq$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

The highest quality score (9) has little alcohol variance.

For quality greater than 5, when the quality score is high, the value of alcohol is also high. For quality less than 5, the relationship is the opposite.

Comparing density to quality, the plot suffers from some overplotting. Most wines have a density between 0.99 and 1.00.

## wq$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## wq$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## wq$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## wq$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## wq$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## wq$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## wq$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970

The highest quality score (9) has little density variance.

For quality greater than 5, when the quality score is high, the value of density is low in general.

As density increases, residual.sugar also increases. The relationship between density and residual.sugar appears to be linear.

As density increases, chlorides also increases. The relationship between density and chlorides appears to be linear.

As alcohol increases, total.sulfur.dioxide decreases. The relationship between alcohol and bound.sulfur.dioxide appears to be linear.

As alcohol increases, chlorides decreases. The relationship between alcohol and chlorides appears to be linear.

As alcohol increases, density decreases. The relationship between alcohol and density appears to be linear.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality correlates with alcohol and density. Density correlates strongly with alcohol.

In general, high quality score often accompanied with high alcolhol and low density. As alcohol inscreases, density decreases. The relationship between alcohol and density appears to be linear.

The highest quality score (9) has little alcohol variance and density variance.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

As density increases, residual.sugar also increases. The relationship between density and residual.sugar appears to be linear. The relationship between density and chlorides is similar.

As alcohol increases, bound.sulfur.dioxide decreases. The relationship between alcohol and bound.sulfur.dioxide appears to be linear. The relationship between alcohol and chlorides is also similar.

What was the strongest relationship you found?

The density is positively and strongly correlated with residual.sugar. The alcohol negatively correlates with bound.sulfur.dioxide and chlorides but this relationships are less strongly than density and residual.sugar. As a result, residual.sugar, bound.sulfur.dioxide and chlorides could be used in a model to predict the quality of wines.

Multivariate Plots Section

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A quality score of 5 or 6 often appears at an alcohol value between 9 to 10. Quality socre of 7 or 8 often appears at an alcohol value between 12 and 13. Wines with very low quality or very hign quality are very rare.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A quality score of 5 or 6 often appears at a density value between 300 to 600. Quality socre of 7 or 8 often appears at an alcohol value between 100 and 250. Wines with very low quality or very hign quality are very rare.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

High quality wines tend to have the high alcohol value and low density value. Wines with low total.sulfur.dioxide and low chlorides values are likely to get high quality. The variance across the groups seems to be about the same with Fair cut diamonds having the least variation for the middle 50% of diamonds.

Holding densisty constant, wines with lower residual.sugar are almost always get lower score in quality than wines with high residual.sugar value (worst quality is 3 and best quality is 9).

Were there any interesting or surprising interactions between features?

Wines with high sugar and low salt (chlorides) tends to get high quality score. This resonates with me because I think the flavor of wines does influence wine quality.

Wines with low bound.sulfur.dioxide and high alcohol value tends to get high quality score. Although SO2 is mostly undetectable in wine, it still influence people when they judge wine quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The distribution of residual.sugar appears to be bimodal on log scale, perhaps due to the demand of wines and buyers purchasing in two different ranges of sweetness. Some prefer high sweetness, while others prefer low sweetness.

Plot Two

Description Two

There is a negative relationship between density and alcohol. As the alcohol increase, the density decreases. This relationship appears to be linear. That does make sense since the density of water is close to that of water depending on the percent alcohol and sugar content. This can also explain the phenomenon that high quality score always accompanied by low density and high alcohol.

Plot Three

Description Three

The plot indicates that a linear model could be constructed to predict the quality as the outcome variable and residual.sugar as the predictor variable. Holding densisty constant, wines with lower residual.sugar are almost always get lower score in quality than wines with high residual.sugar value (worst quality is 3 and best quality is 9).


Reflection

The wines data set contains information on 4898 white wines with 13 variables. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the price of wines across many variables and found a linear relationship between sweetness and quality.

There was a clear trend between the alcohol and density of a wine and its quality. The relationship between alcohol and density also prove this trend. I was surprised that even though SO2 (bound.sulfur.dioxide) is mostly undetectable in wine, it still somehow influence people’s judgement about wine quality.

I struggled understanding the left skrew in residual.sugar histogram. I use log10 to transform the residual.sugar value, and then I found that residual.sugar appears bimodal with value peaking around 2 or so and again at 9 or so. Then I realized that it does make sense because it is due to people’s preference about wine flavor.

There are some limitations existing in the source of this data. The majority of wines are scored 5 to 7. There is a lack of very high quality and very low quality data in this dataset. If I use this dataset to make a model to predict wine quality, the result might not be accurate enough. Given that the wines date to 2009, perhaps more features or other ingredients that will influence wine quality are discovered. The factors provided in this dataset is insufficient. To investigate this data further, I would examine the relationship among other features and make a predict model. I would be interested in testing the linear model to predict current wine quality and to determine to what extent the model is accurate at predicting quality score. A more recent dataset would be better to make predictions of wine quality, and comparisons might be made between the other linear models to see if other variables may account for wine quality.